80

Algorithms for Binary Neural Networks

9DOXH

%DFNWUDFN

*OREDOPLQLPD

%LOLQHDU&RQVWUDLQW

6SDUVH

'HQVH

'5H/8

5HFXUUHQW0RGXOH

3UHGLFW

/RFDOPLQLPD

FIGURE 3.26

An illustration of the RBONN framework. Conventional gradient-based algorithms assume

that hidden variables in bilinear models are independent, which causes an insufficient train-

ing of w due to neglecting the relationship with A as shown in the loss surface (right part).

Our RBONN can help w escape from local minima and achieve a better solution.

can be considered a bilinear optimization problem with the objective function as

arg min

w

G(w, α) =wαbw2

2 + R(w),

or

arg min

w,A

G(w, A) =bw −Aw2

2 + R(w),

(3.121)

where A = diag( 1

α1 , · · · ,

1

αN ), N is the number of elements in α.denotes the channel-

wise multiplication, and R(·) represents regularization, typically the norm1 or2. G(w, A)

includes a bilinear form of Aw widely used in the field of computer vision [52, 162, 97].

Note that the bilinear function is Aw rather than G(w, A) in Equation 6.34. Eq. 6.34 is

rational for BNNs with A and w as bilinear coupled variables, since w is the variable and

bw is just the sign of w.

We introduce a recurrent bilinear optimization for binary neural networks (RBONNs) [259]

by learning the coupled scaling factor and real-valued weight end-to-end. More specifically,

recurrent optimization can efficiently backtrack weights, which will be trained more suffi-

ciently than conventional methods. To this end, a Density-ReLU (DReLU) is introduced

to activate the optimization process based on the density of the variable A. In this way,

we achieve a controlled learning process with a backtracking mechanism by considering the

interaction of variables, thus avoiding the local minima and reaching the performance limit

of BNNs, as shown in Fig. 3.26.

However, such bilinear constraints will lead to an asynchronous convergence problem

and directly affect the learning process of A and w. We can know that the variable with a

slower convergence speed (usually w) is not as sufficiently trained as another faster one.

Moreover, BNNs are based on nonconvex optimization and will suffer more from the local

minima problem due to such an asynchronous convergence. A powerful example is that w

will tendentiously fall into the local optimum with low magnitude when the magnitude of

A is much larger than 0 (due to bw ∈{−1, +1}). On the contrary, w will have a large

magnitude and thus slowly converge when elements of A are close to 0.